1. PREPARE

During the final week of each unit, we will complete a “case study” to illustrate how Learning Analytics methods and techniques can be applied to address research questions of interest, create useful data products, and conduct reproducible research.

Each case study is structured around a basic research workflow modeled after the Data-Intensive Research Workflow from Learning Analytics Goes to School (Krumm et al., 2018). The primary purpose of these case studies is from Learning Analytics Goes to School (Krumm et al., 2018):

Figure 2.2 Steps of Data-Intensive Research Workflow

For Unit 1, we will focus on analysis of open-ended survey items from an evaluation of the North Carolina Department of Public Instruction (NCDPI) online professional development offered as part of the state’s Race to the Top efforts. For more information about the Race to the Top evaluation work, visit <https://cerenc.org/rttt-evaluation/>.
our focus will be on getting our text “tidy” so we can perform some basic word counts, look at words that occur at a higher rate in a group of documents, and examine words that are unique to those document groups. Specifically, the Unit 1 Walkthrough will cover the following workflow topics:

  1. Prepare: Prior to analysis, it’s critical to understand the context and data sources you’re working with so you can formulate useful and answerable questions. You’ll also need to set up a “Project” for our Unit 1 walkthrough.
  2. Wrangle: Wrangling data entails the work of manipulating, cleaning, transforming, and merging data. In section 2 we focus on reading, reducing, and tidying our data.
  3. Explore: In section 3, we use simple summary statistics, more sophisticated approaches like term frequency-inverse document frequency (tf-idf), and basic data visualization to explore our data and see what insight it provides in response to our question.
  4. Model our data until Unit 3 when we learn about topic models, we will be developing “data products” next week to
  5. Communicate our findings and insights.

1a. Review the Literature

Our Unit Case Study is guided by a well-cited publication by

Full Paper (AERA Open)

Abstract

Earlier studies have suggested that higher education institutions could harness the predictive power of Learning Management System (LMS) data to develop reporting tools that identify at-risk students and allow for more timely pedagogical interventions. This paper confirms and extends this proposition by providing data from an international research project investigating which student online activities accurately predict academic achievement. Analysis of LMS tracking data from a Blackboard Vista-supported course identified 15 variables demonstrating a significant simple correlation with student final grade… Moreover, network analysis of course discussion forums afforded insight into the development of the student learning community by identifying disconnected students, patterns of student-to-student communication, and instructor positioning within the network. This study affirms that pedagogically meaningful information can be extracted from LMS-generated student tracking data, and discusses how these findings are informing the development of a dashboard-like reporting tool for educators that will extract and visualize real-time data on student engagement and likelihood of success.

Data Sources

Similar to data used in the we’ll be using later in case study, Rosenberg et al. used publicly accessible data from Twitter collected using the Full-Archive Twitter API and the rtweet package in R. Specifically, the authors accessed tweets and user information from the hashtag-based #NGSSchat online community, all tweets that included any of the following phrases, with “/” indicating an additional phrase featuring the respective plural form: “ngss,” “next generation science standard/s,” “next gen science standard/s.”

Analysis

Also similar to what we’ll demonstrate in Lab 3, the authors determined Tweet sentiment using the Java version of SentiStrength to assign tweets to two 5-point scales of sentiment, one for positivity and one for negativity, because SentiStrength is a validated measure for sentiment in short informal texts (Thelwall et al., 2011). In addition, they used this tool because Wang and Fikis (2019) used it to explore the sentiment of CCSS-related posts. We’ll be using the AFINN sentiment lexicon which also assigns words in a tweet to two 5-point scales, in addition to exploring some other sentiment lexicons to see if they produce similar results.

The authors also used the lme4 package in R to run a mixed effects (or multi-level) model to determine if sentiment changes over time and differs between teachers and non-teacher. We won’t look at the relationships between tweet sentiment, time and teachers in these labs, but we will take a look at the correlation between words within tweets in TM Learning Lab 2.

Summary of Key Findings

  1. Contrasting with sentiment about CSSS, sentiment about the NGSS science education reform effort is overwhelmingly positive, with approximately 9 positive tweets for every negative tweet.
  2. Teachers were more positive than non-teachers, and sentiment became substantially more positive over the ten years of NGSS-related posts.
  3. Differences between the context of the tweets were small, but those that did not include the #NGSSchat hashtag became more positive over time than those posts that did not include the hashtag.
  4. Individuals posted more tweets during #NGSSchat chats, the sentiment of their posts was more positive, suggesting that while the context of individual tweets has a small effect (with posts not including the hashtag becoming more positive over time), the effect upon individuals of being involved in the #NGSSchat was positive.

Finally, you can watch Dr. Rosenberg provide a quick 3-minute overview of this work at <https://stanford.app.box.com/s/i5ixkj2b8dyy8q5j9o5ww4nafznb497x>

1b. Define Questions

One overarching question that Silge and Robinson (2018) identify as a central question to text mining and natural language processing, and that we’ll explore throughout the text mining labs this year, is the question:

How do we to quantify what a document or collection of documents is about? The questions guiding the Rosenberg et al. study attempt to quantify public sentiment around the NGSS and how that sentiment changes over time. Specifically, they asked:

  1. What is the public sentiment expressed toward the NGSS?
  2. How does sentiment for teachers differ from non-teachers?
  3. How do tweets posted to #NGSSchat differ from those without the hashtag?
  4. How does participation in #NGSSchat relate to the public sentiment individuals express?
  5. How does public sentiment vary over time?

For our first lab on text mining in STEM education, we’ll use approaches similar to those used by the authors cited above to better understand public discourse surrounding these standards, particularly as they relate to STEM education. We will also try to guage public sentiment around the NGSS, by comparing how much more positive or negative NGSS tweets are relative to CSSS tweets. Specifically, in the next four learning lab we’ll attempt to answer the following questions:

  1. What are the most frequent words or phrases used in reference to tweets about the CCSS and NGSS?
  2. What words and hashtags commonly occur together?
  3. How does sentiment for NGSS compare to sentiment for CCSS?

1c. Load Libraries

tidyverse 📦

As noted in our Getting Started activity, R uses “packages” and add-ons that enhance its functionality. One package that we’ll be using extensively is {tidyverse}. The {tidyverse} package is actually a collection of R packages designed for reading, wrangling, and exploring data and which all share an underlying design philosophy, grammar, and data structures. This shared features are sometimes “tidy data principles.”

Click the green arrow in the right corner of the “code chunk” that follows to load the {tidyverse} library as well as the {here} package introduced in previous labs.

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.3     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.3     ✓ stringr 1.4.0
## ✓ readr   2.0.1     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(here)
## here() starts at /Volumes/GoogleDrive/My Drive/College of Ed/Learning Analytics/Courses/ECI 586 Intro to LA/GitHub/eci-586
library(janitor)
## 
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

Again, don’t worry if you saw a number of messages: those probably mean that the tidyverse loaded just fine. Any conflicts you may have seen mean that functions in these packages you loaded have the same name as functions in other packages and R will default to function from the last loaded package unless you specify otherwise.

2. WRANGLE

In general, data wrangling involves some combination of cleaning, reshaping, transforming, and merging data (Wickham & Grolemund, 2017). The importance of data wrangling is difficult to overstate, as it involves the initial steps of going from raw data to a dataset that can be explored and modeled (Krumm et al, 2018). In Part 2, we focus on the the following

  1. Import Data. In this section, we introduce the read_csv() function for working with CSV files revisit some key functions to inspecting our data.

  2. Tidy Data. We revisit some key {dplyr} functions like mutate() for creating or changing variables, and introduce the separate() and clean_names() functions for getting our data nice and tidy.

  3. Join Data. We conclude our data wrangling by introducing join() functions for merging our processed files into a single data frame for analysis.

a. Import Data

Education data are stored in all sorts of different file formats and structures. In this course, we’ll discuss several of these common formats, how to import your data into R, and how to transform you data into other data formats such as network objects required for social network analysis in Unit 3. In this case study, we’ll focus on working with Comma-separated values (CSV) files.

Similar to spreadsheet formats Excel and Google Sheets, CSVs allow us to store rectangular data frames, but in a much simpler plain-text format, where all the important information in the file is represented by text. Note that “text” here refers to numbers, letters, and symbols you can type on your keyboard. In Tidyverse Skills for Data Science, Wright et al. (2021) note that the advantage of CSVs is that

… that there are no workbooks or metadata making it difficult to open these files. CSVs are flexible files and are thus the preferred storage method for tabular data for many data scientists .

Data Source #1: Log Data

Log-trace data is data generated from our interactions with digital technologies, such as archived data from social media postings. In education, an increasingly common source of log-trace data is that generated from interactions with LMS and other digital tools.

The data we will use is a summary type of log-trace data: the number of minutes students spent on the course. While this data type is fairly straightforward, there are even more complex sources of log-trace data out there (e.g., time stamps associated with when students started and stopped accessing the course). We’ll name this data set time_spent, to help us to quickly recollect what function it serves in this analysis.

time_spent <- read_csv(here("unit-1", "data", "log-data.csv"))
## Rows: 716 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): course_id, gender, enrollment_reason, enrollment_status
## dbl (2): student_id, time_spent
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Type time_spent into the console (below this window) and then hit return/enter. You should see a printed summary of this data frame. What do you notice about this data? What do you wonder? Add a note (or more—you can type return/enter after a bullet point to add another) on your noticings and wonderings here:

Data Source #2: Academic Achievement Data

In addition to the time_spent data we loaded We’ll explain a bit more about the second data set - on academic achievement data.

Academic achievement data is (obviously) is a very common form of data in education. In this learning lab, we’ll use both the sum of the points students earned as well as the number of points possible to compute the percentage of points they earned in the course—a measure comparable (but likely a little different based on teachers’ grading policies) to their final grade. We’ll use this in the second learning lab.

We’ll load the data in the same way as earlier:

gradebook <- read_csv(here("unit-1", "data", "gradebook-summary.csv"))
## Rows: 717 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): course_id
## dbl (3): student_id, total_points_possible, total_points_earned
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

You may choose to type the name of the gradebook into the console window and to then (like earlier) type enter/return to view a summary of what data this name points to.

Data Source #3: Self-Report Survey

The third data source is a self-report survey. This was data collected before the start of the course. The survey included ten items, each corresponding to one of three motivation measures: interest, utility value, and perceived competence. These were chosen for their alignment with one way to think about students’ motivation, to what extent they expect to do well (corresponding to their perceived competence) and their value for what they are learning (corresponding to their interest and utility value). We’ll use this in the third learning lab.

We’ll read this file, named survey.csv, which is—like our other data files—stored in the /data directory.

Your Turn

In the code below, read the survey.csv file. You can use the code above as a template. Assign the output from the read_csv() function to a new object named survey.

survey <- read_csv(here("unit-1", "data", "survey.csv"))
## Rows: 662 Columns: 26
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr   (5): student_ID, course_ID, subject, semester, section
## dbl  (18): int, val, percomp, tv, q1, q2, q3, q4, q5, q6, q7, q8, q9, q10, p...
## dttm  (3): date.x, date.y, date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Hint: By asking you to “assign the output from the read_csv() function the name survey, consider how in the code chunk above this”Your Turn" code chunk, we assigned the output from the read_csv() function to the name gradebook.

After reading the data, let’s continue the practice of looking at our data. Type survey into the console to take a look at the data: Does it appear to be the correct file? What do the variables seem to be about? What wrangling steps do we need to take? Taking a quick peak at the data helps us to begin to formulate answers to these and is an important step in any data analysis, especially as we prepare for what we are going to do.

Add one or more of the things you notice or wonder about the data here:

View Data

Once your data is in R, there are many different ways you can view it. Give each of the following at try:

# enter the name of your data frame and view directly in the console or a code chunk
survey
## # A tibble: 662 × 26
##    student_ID course_ID subject semester section   int   val percomp    tv    q1
##    <chr>      <chr>     <chr>   <chr>    <chr>   <dbl> <dbl>   <dbl> <dbl> <dbl>
##  1 43146      FrScA-S2… FrScA   S216     02        4.2  3.67     4    3.86     4
##  2 44638      OcnA-S11… OcnA    S116     01        4    3        3    3.57     4
##  3 47448      FrScA-S2… FrScA   S216     01        4.2  3        3    3.71     5
##  4 47979      OcnA-S21… OcnA    S216     01        4    3.67     2.5  3.86     4
##  5 48797      PhysA-S1… PhysA   S116     01        3.8  3.67     3.5  3.71     4
##  6 51943      FrScA-S2… FrScA   S216     03        3.8  3.67     3.5  3.71     4
##  7 52326      AnPhA-S2… AnPhA   S216     01        3.6  4        3    4        4
##  8 52446      PhysA-S1… PhysA   S116     01        4.2  3.67     3    4        4
##  9 53447      FrScA-S1… FrScA   S116     01        3.8  2        3    3        5
## 10 53475      FrScA-S2… FrScA   S216     01        4.8  3.33     4    4.14     5
## # … with 652 more rows, and 16 more variables: q2 <dbl>, q3 <dbl>, q4 <dbl>,
## #   q5 <dbl>, q6 <dbl>, q7 <dbl>, q8 <dbl>, q9 <dbl>, q10 <dbl>, date.x <dttm>,
## #   post_int <dbl>, post_uv <dbl>, post_tv <dbl>, post_percomp <dbl>,
## #   date.y <dttm>, date <dttm>
# view your data frame transposed so your can see every column and the first few entries
glimpse(survey) 
## Rows: 662
## Columns: 26
## $ student_ID   <chr> "43146", "44638", "47448", "47979", "48797", "51943", "52…
## $ course_ID    <chr> "FrScA-S216-02", "OcnA-S116-01", "FrScA-S216-01", "OcnA-S…
## $ subject      <chr> "FrScA", "OcnA", "FrScA", "OcnA", "PhysA", "FrScA", "AnPh…
## $ semester     <chr> "S216", "S116", "S216", "S216", "S116", "S216", "S216", "…
## $ section      <chr> "02", "01", "01", "01", "01", "03", "01", "01", "01", "01…
## $ int          <dbl> 4.2, 4.0, 4.2, 4.0, 3.8, 3.8, 3.6, 4.2, 3.8, 4.8, 4.6, 3.…
## $ val          <dbl> 3.666667, 3.000000, 3.000000, 3.666667, 3.666667, 3.66666…
## $ percomp      <dbl> 4.0, 3.0, 3.0, 2.5, 3.5, 3.5, 3.0, 3.0, 3.0, 4.0, 4.0, 3.…
## $ tv           <dbl> 3.857143, 3.571429, 3.714286, 3.857143, 3.714286, 3.71428…
## $ q1           <dbl> 4, 4, 5, 4, 4, 4, 4, 4, 5, 5, 4, 3, 4, 4, 5, 4, 4, 5, 5, …
## $ q2           <dbl> 4, 2, 3, 3, 4, 4, 4, 4, 2, 4, 5, 1, 3, 2, 5, 4, 4, 4, 3, …
## $ q3           <dbl> 4, 2, 3, 2, 3, 3, 4, 3, 3, 4, 4, 3, 3, 2, 4, 3, 3, 4, 4, …
## $ q4           <dbl> 5, 4, 4, 4, 4, 4, 2, 4, 4, 5, 4, 4, 4, 5, 5, 4, 4, 5, 5, …
## $ q5           <dbl> 4, 4, 4, 4, 4, 3, 4, 4, 4, 5, 5, 4, 4, 5, 5, 4, 4, 5, 5, …
## $ q6           <dbl> 4, 4, 3, 4, 4, 3, 4, 4, 2, 3, 5, 3, 3, 3, 5, 4, 4, 4, 5, …
## $ q7           <dbl> 4, 4, 3, 3, 4, 4, 2, 3, 3, 4, 4, 4, 4, 4, 4, 4, 3, 5, 4, …
## $ q8           <dbl> 4, 4, 4, 4, 4, 4, 4, 5, 4, 5, 5, 3, 4, 5, 5, 4, 5, 5, 5, …
## $ q9           <dbl> 3, 3, 3, 4, 3, 4, 4, 3, 2, 3, 4, 2, 3, 1, 5, 3, 3, 4, 3, …
## $ q10          <dbl> 4, 4, 4, 4, 3, 4, 4, 4, 2, 4, 5, 3, 4, 3, 5, 4, 5, 5, 5, …
## $ date.x       <dttm> 2016-02-02 18:44:00, 2015-09-09 13:41:00, 2016-01-28 14:…
## $ post_int     <dbl> NA, NA, NA, NA, NA, NA, NA, 3.50, 3.75, NA, 5.00, NA, NA,…
## $ post_uv      <dbl> NA, NA, NA, NA, NA, NA, NA, 3.666667, 2.000000, NA, 4.666…
## $ post_tv      <dbl> NA, NA, NA, NA, NA, NA, NA, 3.571429, 3.000000, NA, 4.857…
## $ post_percomp <dbl> NA, NA, NA, NA, NA, NA, NA, 3.5, 3.0, NA, 4.0, NA, NA, 3.…
## $ date.y       <dttm> NA, NA, NA, NA, NA, NA, NA, 2016-01-02 00:41:00, 2015-10…
## $ date         <dttm> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
# look at just the first six entries
head(survey) 
## # A tibble: 6 × 26
##   student_ID course_ID  subject semester section   int   val percomp    tv    q1
##   <chr>      <chr>      <chr>   <chr>    <chr>   <dbl> <dbl>   <dbl> <dbl> <dbl>
## 1 43146      FrScA-S21… FrScA   S216     02        4.2  3.67     4    3.86     4
## 2 44638      OcnA-S116… OcnA    S116     01        4    3        3    3.57     4
## 3 47448      FrScA-S21… FrScA   S216     01        4.2  3        3    3.71     5
## 4 47979      OcnA-S216… OcnA    S216     01        4    3.67     2.5  3.86     4
## 5 48797      PhysA-S11… PhysA   S116     01        3.8  3.67     3.5  3.71     4
## 6 51943      FrScA-S21… FrScA   S216     03        3.8  3.67     3.5  3.71     4
## # … with 16 more variables: q2 <dbl>, q3 <dbl>, q4 <dbl>, q5 <dbl>, q6 <dbl>,
## #   q7 <dbl>, q8 <dbl>, q9 <dbl>, q10 <dbl>, date.x <dttm>, post_int <dbl>,
## #   post_uv <dbl>, post_tv <dbl>, post_percomp <dbl>, date.y <dttm>,
## #   date <dttm>
# or the last six entries
tail(survey) 
## # A tibble: 6 × 26
##   student_ID course_ID  subject semester section   int   val percomp    tv    q1
##   <chr>      <chr>      <chr>   <chr>    <chr>   <dbl> <dbl>   <dbl> <dbl> <dbl>
## 1 19         AnPhA-S21… AnPhA   S217     02        4.2  5        5    4.5      5
## 2 42         FrScA-S21… FrScA   S217     01        4    4        4    4        4
## 3 52         FrScA-S21… FrScA   S217     03        4.4  2.67     3.5  3.75     4
## 4 57         FrScA-S21… FrScA   S217     01        4.4  2.33     2.5  3.62     5
## 5 72         FrScA-S21… FrScA   S217     01        5    3        4    4.25     5
## 6 80         FrScA-S21… FrScA   S217     01        3.6  2.33     3    3.12     4
## # … with 16 more variables: q2 <dbl>, q3 <dbl>, q4 <dbl>, q5 <dbl>, q6 <dbl>,
## #   q7 <dbl>, q8 <dbl>, q9 <dbl>, q10 <dbl>, date.x <dttm>, post_int <dbl>,
## #   post_uv <dbl>, post_tv <dbl>, post_percomp <dbl>, date.y <dttm>,
## #   date <dttm>
# view the names of your variables or columns
names(survey) 
##  [1] "student_ID"   "course_ID"    "subject"      "semester"     "section"     
##  [6] "int"          "val"          "percomp"      "tv"           "q1"          
## [11] "q2"           "q3"           "q4"           "q5"           "q6"          
## [16] "q7"           "q8"           "q9"           "q10"          "date.x"      
## [21] "post_int"     "post_uv"      "post_tv"      "post_percomp" "date.y"      
## [26] "date"
# or view in source pane
View(survey)

Yes, the “V” is capitalized—very unusual for an R function. Because this function is a bit atypical in more ways than one, I have two recommendations concerning its use:

  • Use it strictly in the console. Because it opens a new viewing window, including it in an R Markdown script can cause issues when “knitting” or rendering an HTML (or PDF) report. Hence I have included the eval = FALSE argument in the code chunk so it it not run when you knit your document.

  • Close the viewer window that opens once you have viewed the data. Keeping it open can clutter your work space a bit and can lead to confusion about what data frame it was you viewed.

b. Tidy Data

Intro…

Process Log Data

Earlier, we loaded time_spent, which contains information on the number of minutes that students spent on the course, as well as other variables, particularly course_id.

Information about the course subject, semester, and section are stored in a single variable—course_id. This format of data storage is not ideal, nor is it very “tidy.” If we instead give each piece of information its own column, we’ll have more opportunities for later analysis. We’ll use a function called separate() to do this.

First, let’s practice with a small data set. We’ll create it directly in R; run the code below to do that (and to assign the name df to the dataset).

df <- tibble(course_var = c("Fall - Chemistry", 
                            "Fall - Earth Science", 
                            "Spring - Forensic Science",
                            "Spring - Earth Science",
                            "Spring - Biology"))

df
## # A tibble: 5 × 1
##   course_var               
##   <chr>                    
## 1 Fall - Chemistry         
## 2 Fall - Earth Science     
## 3 Spring - Forensic Science
## 4 Spring - Earth Science   
## 5 Spring - Biology

Print df to the console. You should see a single variable, course_info, with four rows.

In this (very small) data frame, there is information about both the semester and the course are encoded within the same variable. The separate() function has two primary arguments, one each for:

  1. the variable you want to separate
  2. the names of the new variables to create

Below, see using course_var for #1, and c("Semester", "Course") for #2, can be used to separate the semester and course data into two separate variable

df %>% 
  separate(course_var, c("semester", "course"))
## Warning: Expected 2 pieces. Additional pieces discarded in 3 rows [2, 3, 4].
## # A tibble: 5 × 2
##   semester course   
##   <chr>    <chr>    
## 1 Fall     Chemistry
## 2 Fall     Earth    
## 3 Spring   Forensic 
## 4 Spring   Earth    
## 5 Spring   Biology

Next, let’s try something slightly different. Here, we have a data frame with a variable that encodes three pieces of information within the same variable: the year, semester, and subject. There are also a few other differences.

df2 <- tibble(course_variable = c("19-Fall-Algebra I", 
                                  "20-Fall-Algebra II", 
                                  "20-Spring-Algebra I",
                                  "20-Spring-Algebra II",
                                  "21-Fall-Algebra I"))
df2
## # A tibble: 5 × 1
##   course_variable     
##   <chr>               
## 1 19-Fall-Algebra I   
## 2 20-Fall-Algebra II  
## 3 20-Spring-Algebra I 
## 4 20-Spring-Algebra II
## 5 21-Fall-Algebra I

Your Turn

Can you separate the variable in the above data frame not into two, but rather three, new variables? Below is some template code.

df2 %>% 
  separate(course_variable,
           c("year", "semester", "subject"))
## Warning: Expected 3 pieces. Additional pieces discarded in 5 rows [1, 2, 3, 4,
## 5].
## # A tibble: 5 × 3
##   year  semester subject
##   <chr> <chr>    <chr>  
## 1 19    Fall     Algebra
## 2 20    Fall     Algebra
## 3 20    Spring   Algebra
## 4 20    Spring   Algebra
## 5 21    Fall     Algebra

Hint: Try to modify the code from above (in which you separated course_var into two variables) based on a) the name of the variable in df2 and b) adding the name for the third new variable you wish to create.

Your Turn

Let’s return back to our time_spent data frame, now. It is often helpful to take a look at the data before writing code.

Below, we will load time_spent and run the separate() function with the course_id variable to split up the subject, semester, and section so we can use them later on. In other words, whereas above we separated the variable course_variable, in the data set we’ll use here, we’ll separate the course_id variable.

time_spent %>%
  separate(course_id,
           c("subject", "semester", "section"))
## # A tibble: 716 × 8
##    student_id subject semester section gender enrollment_reason enrollment_stat…
##         <dbl> <chr>   <chr>    <chr>   <chr>  <chr>             <chr>           
##  1      60186 AnPhA   S116     01      M      Course Unavailab… Approved/Enroll…
##  2      66693 AnPhA   S116     01      M      Course Unavailab… Approved/Enroll…
##  3      66811 AnPhA   S116     01      F      Course Unavailab… Approved/Enroll…
##  4      66862 AnPhA   S116     01      F      Course Unavailab… Approved/Enroll…
##  5      67508 AnPhA   S116     01      F      Scheduling Confl… Approved/Enroll…
##  6      70532 AnPhA   S116     01      F      Learning Prefere… Approved/Enroll…
##  7      77010 AnPhA   S116     01      F      Learning Prefere… Approved/Enroll…
##  8      85249 AnPhA   S116     01      F      Course Unavailab… Approved/Enroll…
##  9      85411 AnPhA   S116     01      F      Scheduling Confl… Approved/Enroll…
## 10      85583 AnPhA   S116     01      F      Scheduling Confl… Approved/Enroll…
## # … with 706 more rows, and 1 more variable: time_spent <dbl>

There is one last key step—one that is likely to be a bit disorienting at first—that we’ll do next. Once we’ve processed the data how we would like, we have to assign, or save, the results back to the name for the data with which we have been working. This is done with the assignment operator, or the <- symbol. Copy the code you successfully ran in the chunk above to follow the assignment operator in the chunk below. In other words, write the code you wrote above, but assign the output back to time_spent.

time_spent <- time_spent %>%
  separate(course_id,
           c("subject", "semester", "section"))

We have made a habit of continually looking at our data after running code to ensure that the step worked as intended. Not in a code chunk, but rather in the console below, type the name of the data we have been working with to ensure that the course_id variable has been separated into three variables that correspond to the subject, semester, and section.

If those look good, let’s proceed to the next step. If something doesn’t look right, consider re-running the code chunks above, perhaps returning all the way to the first code chunk that you ran (to load the data) to ensure that the output is as you intended for it to be.

Mutating a column to change the time spent variable to represent hours

In the above code, you used separate to create new variables based on an existing variable. While that function serves a specific problem (when there are effectively multiple variables combined in one), its use represents a pattern that is fairly common: you use a function to solve a problem; figuring out how it works, checking the output, then assigning the output back to the name of the data frame, after which you can proceed to the next step.

There are a lot of other functions like separate that help you to solve specific problems, and we’ll introduce many over the two-week institute - and will point you to resources that describe many more.

There are also functions that can serve as general purpose tools that can solve many problems; one of the most useful is mutate()`, a function to create new variables in a data set. Specifically, we’ll use mutate() to create a new variable for the percentage of points each student earned; keep in mind as you work through these steps how so many parts of wrangling data involves either changing a variable or creating a new one. For these purposes, mutate can be very helpful.

Let’s begin again with a small data set with two variables, var_a and var_b. Run the chunk below.

df3 <- tibble(var_a = c(30, 50, 30, 10, 30, 40, 40, 30, 20, 50),
              var_b = c(100, 90, 60, 70, 60, 80, 70, 50, 30, 20))

Next, print df3 to the console. You should see two numeric variables; imagine they represent points that students earned on a 50-point quiz and a 100-point test, respectively. There are a lot of things that you might wish to do with these variables. For instance, you may wish to sum them together. The code below does this.

df3 %>% 
  mutate(points_sum = var_a + var_b)
## # A tibble: 10 × 3
##    var_a var_b points_sum
##    <dbl> <dbl>      <dbl>
##  1    30   100        130
##  2    50    90        140
##  3    30    60         90
##  4    10    70         80
##  5    30    60         90
##  6    40    80        120
##  7    40    70        110
##  8    30    50         80
##  9    20    30         50
## 10    50    20         70

Your Turn

We can combine many mutate() functions together. Below, create a new variable (let’s call it points_proportion) that represents the proportion of the total points students could, potentially, earn. To do this, you can divide points_sum by the maximum possible points—150.

df3 %>% 
  mutate(points_sum = var_a + var_b) %>% 
  mutate(points_proportion = points_sum / 150)
## # A tibble: 10 × 4
##    var_a var_b points_sum points_proportion
##    <dbl> <dbl>      <dbl>             <dbl>
##  1    30   100        130             0.867
##  2    50    90        140             0.933
##  3    30    60         90             0.6  
##  4    10    70         80             0.533
##  5    30    60         90             0.6  
##  6    40    80        120             0.8  
##  7    40    70        110             0.733
##  8    30    50         80             0.533
##  9    20    30         50             0.333
## 10    50    20         70             0.467

Hint: Just like you can use the + symbol to add variables together, you can use the / symbol to divide a variable by another—or by a value, like 150!

After adding the above, you should see output that contains four variables, one each for var_a and var_b, points_sum, which represents the sum of the points students earned, and points_proportion, which represents the percentage of the total points students earned.

Your Turn

Let’s now process the time_spent variable. Specifically, this variable currently represents the number of minutes that students spent on the course LMS. Below, add to the template of code below to create a new variable, time_spent_hours, that represents the number of hours that students spent on the course LMS.

time_spent %>% 
  mutate(time_spent_hours = time_spent / 60)
## # A tibble: 716 × 9
##    student_id subject semester section gender enrollment_reason enrollment_stat…
##         <dbl> <chr>   <chr>    <chr>   <chr>  <chr>             <chr>           
##  1      60186 AnPhA   S116     01      M      Course Unavailab… Approved/Enroll…
##  2      66693 AnPhA   S116     01      M      Course Unavailab… Approved/Enroll…
##  3      66811 AnPhA   S116     01      F      Course Unavailab… Approved/Enroll…
##  4      66862 AnPhA   S116     01      F      Course Unavailab… Approved/Enroll…
##  5      67508 AnPhA   S116     01      F      Scheduling Confl… Approved/Enroll…
##  6      70532 AnPhA   S116     01      F      Learning Prefere… Approved/Enroll…
##  7      77010 AnPhA   S116     01      F      Learning Prefere… Approved/Enroll…
##  8      85249 AnPhA   S116     01      F      Course Unavailab… Approved/Enroll…
##  9      85411 AnPhA   S116     01      F      Scheduling Confl… Approved/Enroll…
## 10      85583 AnPhA   S116     01      F      Scheduling Confl… Approved/Enroll…
## # … with 706 more rows, and 2 more variables: time_spent <dbl>,
## #   time_spent_hours <dbl>

Hint: Refer to the code you wrote above, being clear about a) what the name of the new variable you are creating is and b) how you will create this variable using division (by the number of minutes in an hour).

We used the above as a test bed to ensure that our code worked as intended. Once we are confident that we are creating the variable in the way we intend to, we can assign the output back to the data frame that time_spent refers to.

time_spent <- time_spent %>% 
  mutate(time_spent_hours = time_spent / 60)

Good work wrangling this dataset!

Process Gradebook Data

Now let’s process the gradebook data. In particular, we’ll separate the course_id variable in the same way we separated that variable in the log data, and we’ll also calculate a new variable representing the proportion of points students earned (out of the points possible to earn).

Let’s start with separating the course_id variable. Run the code in the next chunk to do this. If you named the three parts of the course ID differently than they’re named below (and saved the data you processed to use in this learning lab), be sure that these three variables are named identically; this is the key (pun intended!) to these variables joining correctly.

gradebook <- gradebook %>% 
  separate(course_id,
           c("subject", "semester", "section"))

Next, we’ll mutate our data set to create a new column, one representing the proportion of points students earned. Let’s consider a data frame with example data, df4 .

df4 <- tibble(var_a = c(8, 8, 7, 8, 9, 6, 8, 8, 7, 8),
              var_b = 9)

Note: To create df4, for var_a, we passed a vector that we created with the function c() that contains 10 values. Consider these to be the number of times that learners participated in an outside-of-school STEM club. Instead of passing another vector for var_b, we simply used the value 9, which represents the number of opportunities students had to participate in the outside-of–school STEM club . In this case, the value 9 is repeated for however many rows there are in the data frame. Thus, in the context of creating a data frame, var_b = 9 is the same as var_b = c(9, 9, 9, 9, 9, 9, 9, 9, 9, 9).

Since interpreting proportions when the denominator is nine can be difficult, we may which to create a variable for the proportion.

Your Turn

After running the chunk above, print df4 to the console to get a sense of what the data frame consists of. To create a third variable that represents the proportion of STEM club activities students participated in, divide var_a by var_b.

df4 <- df4 %>% 
  mutate(var_a / var_b)

What happens if the output is different than you intended? That’s no problem! Re-run the code-chunk above (in which we create df) to have a blank slate with which to try again.

Your Turn

Your turn once more. This time, create a new variable—here, let’s name it proportion_earned—using the gradebook data. This will involve using the mutate function with the gradebook data, creating a new variable (proportion_earned) on the basis of the values of two existing variables:

  • total_points_possible

  • total_points_earned

Also, once your code is ready, you’ll need to assign the results back to gradebook. This is challenging as you’re starting from scratch with the code. However, good R programmers use other code (that they or others wrote!) often, so feel free to copy and paste code from other, similar problems to give yourself a head start.

gradebook <- gradebook %>% 
  mutate(proportion_earned = total_points_earned / total_points_possible)

Once the above step is complete, take another look at gradebook by printing it to the console or viewing it using a preferred method. There should now be seven columns, the six originally in the data and a new, seventh variable you’ve “mutated.”

Process Survey Data

Finally, let’s process our survey data that we imported earlier . First though, take a quick look again by typing survey into the console or using a preferred viewing method to take a look at the data.

Does it appear to be the correct file? What do the variables seem to be about? What wrangling steps do we need to take? Taking a quick peak at the data helps us to begin to formulate answers to these and is an important step in any data analysis, especially as we prepare for what we are going to do.

Add one or more of the things you notice or wonder about the data here:

You may have noticed that student_ID is not formatted the same as student_id in our other files. This is important because in the next section when were “join,” or merge, our data files, these variables will need to have identical names.

Fortunately there is a handy function called clean_names() in the {janitor} package for standardizing variable names. Run the following code

survey <- clean_names(survey)

Let’s take one more look at the data by typing its name into the console or using a method of your choice to check that the above function appeared to work; if it did, the names should be lower-case, and any symbols or spaces should now be replaced by an underscore (_).

c. Joining the data

We’re now ready to join! At their core, joins involve operations on two data frames at the same time. This may seem useful only in certain cases, but consider the following data analysis tasks:

  • You have collected data from students from one of ten classrooms; at the same time, you have data on the teachers of those ten classes (five of which tried out a new curriculum, and five who taught a “business-as-usual” curriculum

  • You are studying the posts on Twitter and Pinterest of one of around 100 mathematics teachers

  • After working with a local school district, you collected survey responses from 100s of teachers who teach in one of approximately 25 elementary, middle, and high schools; you received data from the district on the characteristics of the schools, including how many students they serve and how many teachers work in them

In each of these cases—and many others like them—your single analysis involves multiple data files. While in some cases it is possible to analyze each data set individually, it is often useful (or necessary, depending upon your goal) to join these sources of data together. This is especially the case for learning analytics research, in which researchers and analysts often are interested in understanding teaching and learning through the lens of multiple data sources, including digital data, institutional records, and survey data, among other sources. In all of these cases, knowing how to promptly join together files—even files with tens of thousands of hundreds of thousands of rows—can be empowering.

Consider two example data frames. df5 contains a variable with four student names, name and a variable for the number of STEM-related classes they have taken, n_stem_classes.

df6 contains a variable with three student names, name (like in df5), as well as another, different variable, for students’ self-reported interest in STEM topics, interest_in_stem, measured on a one-seven scale, with seven indicating higher levels of interest.

Run the code below and then type df5 and df6 in the console.

df5 <- tibble(name = c("Sheila", "Tayla", "Marcus"),
              n_stem_classes = c(4, 5, 6))

df5
## # A tibble: 3 × 2
##   name   n_stem_classes
##   <chr>           <dbl>
## 1 Sheila              4
## 2 Tayla               5
## 3 Marcus              6
df6 <- tibble(name = c("Tayla", "Marcus", "Sheila", "Vin"),
              interest_in_stem = c(4, 7, 6, 6))

df6
## # A tibble: 4 × 2
##   name   interest_in_stem
##   <chr>             <dbl>
## 1 Tayla                 4
## 2 Marcus                7
## 3 Sheila                6
## 4 Vin                   6

A key (pun intended) with joins is to consider what variable(s) will serve as the key. This is the variable to join by.

A key must have two characteristics; it is:

  • a character string— a word (thus, you cannot join on a number unless you “coerce” or change it to be a character string, first)

  • present in both of the data frames you are joining.

To join two datasets, it is important that the key (or keys) on which you are joining the data is formatted identically. The key represents an identifier that is present in both of the data sets you are joining. For instance, you may have data collected from (or created about) the same students that are from two very different sources, such as a self-report survey of students and their teacher-assigned grades in class.

While some of the time it takes some thought to determine what the key is (or what the keys are—you can join on multiple keys!), in the above case, there is just one variable that meets both of the above characteristics.

Your Turn

In the code below, enter the name of the variable that is the key within the quotation marks following by =. Then run the code chunk and note the output.

full_join(df5, df6, by = "name")
## # A tibble: 4 × 3
##   name   n_stem_classes interest_in_stem
##   <chr>           <dbl>            <dbl>
## 1 Sheila              4                6
## 2 Tayla               5                4
## 3 Marcus              6                7
## 4 Vin                NA                6

What do you notice about the output of the full_join()? All observations are valid; consider how the output is similar to and different from df5 and df6, particularly in one or more notes following the bullet point.

full_join() is one of a number of joins from which we can choose. full_join() is distinguished from the other joins by how it returns all of the rows in both of the data frames being joined. If a particular key is present in one of the data frames but not the other, the values for the variable in the data set for which the key is not present are simply recorded as missing (like in the above, where there is no value for the number of STEM classes Vin has taken).

There is one other join on which we’ll focus for now. That is left_join(), which differs from full_join() in that it returns all of the rows in the “left” data frame, the data frame named first in the function, but not all of the rows in the “right” data frame: it retains only the rows in the “right” data frame, the data frame named second in the function, that have a matching key. An example is necessary. Before running the code below, add the same key you added above.

Your Turn

left_join(df5, df6, by = "name")
## # A tibble: 3 × 3
##   name   n_stem_classes interest_in_stem
##   <chr>           <dbl>            <dbl>
## 1 Sheila              4                6
## 2 Tayla               5                4
## 3 Marcus              6                7

Different from the above, left_join() did not return all of the rows from both data frames, instead returning all of the rows in the “left” data frame (and those in the “right” data frame with a match).

Join Gradebook and Log Data

For now, we’re going to use a single join function, full_join(). In the code below, join gradebook and time_spent; type the names of those two data frames as arguments to the full_join() function in a similar manner as in the full_join() code above, and then run this code chunk. For now, don’t specify anything for the by = part of the function.

# join together the gradebook and log_wrangled
joined_data <- full_join(gradebook, time_spent)
## Joining, by = c("student_id", "subject", "semester", "section")
joined_data
## # A tibble: 830 × 12
##    student_id subject semester section total_points_possible total_points_earned
##         <dbl> <chr>   <chr>    <chr>                   <dbl>               <dbl>
##  1      43146 FrScA   S216     02                       1217               1150 
##  2      44638 OcnA    S116     01                       1676               1384.
##  3      47448 FrScA   S216     01                       1232               1116 
##  4      47979 OcnA    S216     01                       1833               1493.
##  5      48797 PhysA   S116     01                       2225               1995.
##  6      51943 FrScA   S216     03                       1222                 70 
##  7      52326 AnPhA   S216     01                       1775               1519.
##  8      52446 PhysA   S116     01                       2225               2198 
##  9      53447 FrScA   S116     01                       1212               1173 
## 10      53475 FrScA   S116     02                       1212                  0 
## # … with 820 more rows, and 6 more variables: proportion_earned <dbl>,
## #   gender <chr>, enrollment_reason <chr>, enrollment_status <chr>,
## #   time_spent <dbl>, time_spent_hours <dbl>

You may notice a red message that says Joining, by = c("student_id", "Course", "Subject", "Section"). This is telling us that these files are being joined on the basis of all four of these variables matching in both data sets; in other words, for rows to be joined, they must match identically on all four of these variables.

This is related to not specifying anything for the by = part of the function; by default, full_join() (and left_join()) will consider any character strings with identical names that are present in both data sets to be keys. But, it’s generally better practice to specify the variables on which we are joining.

Your Turn

In the code below, write your join like above, but add the by = c("student_id", "course", "subject", "section") part to your code. You may notice the red message you may have noticed does not appear. This is generally a better practice because you know precisely on which variables you data sets are joining.

# join together the gradebook and time_spent
joined_data <- full_join(gradebook, time_spent, by = c("student_id", "subject", "semester", "section"))

joined_data
## # A tibble: 830 × 12
##    student_id subject semester section total_points_possible total_points_earned
##         <dbl> <chr>   <chr>    <chr>                   <dbl>               <dbl>
##  1      43146 FrScA   S216     02                       1217               1150 
##  2      44638 OcnA    S116     01                       1676               1384.
##  3      47448 FrScA   S216     01                       1232               1116 
##  4      47979 OcnA    S216     01                       1833               1493.
##  5      48797 PhysA   S116     01                       2225               1995.
##  6      51943 FrScA   S216     03                       1222                 70 
##  7      52326 AnPhA   S216     01                       1775               1519.
##  8      52446 PhysA   S116     01                       2225               2198 
##  9      53447 FrScA   S116     01                       1212               1173 
## 10      53475 FrScA   S116     02                       1212                  0 
## # … with 820 more rows, and 6 more variables: proportion_earned <dbl>,
## #   gender <chr>, enrollment_reason <chr>, enrollment_status <chr>,
## #   time_spent <dbl>, time_spent_hours <dbl>

Hint: If you’re curious about how to format the use of the by = part of your code, look up above at how you used this argument to the full_join() function.

What do you notice about the result—the data you joined? In particular, how does it differ from the two data sets from which it was created? Add one or more notes below.

Join Gradebook and Log Data

Now

Join

# join together the gradebook and log_wrangled
data_to_explore <- full_join(joined_data, survey, by = c("student_id", "subject", "semester", "section"))

data_to_explore
joined_data <- joined_data %>%
  mutate(student_id = as.character(student_id))
data_to_explore <- full_join(joined_data, survey, by = c("student_id", "subject", "semester", "section"))

data_to_explore
## # A tibble: 943 × 34
##    student_id subject semester section total_points_possible total_points_earned
##    <chr>      <chr>   <chr>    <chr>                   <dbl>               <dbl>
##  1 43146      FrScA   S216     02                       1217               1150 
##  2 44638      OcnA    S116     01                       1676               1384.
##  3 47448      FrScA   S216     01                       1232               1116 
##  4 47979      OcnA    S216     01                       1833               1493.
##  5 48797      PhysA   S116     01                       2225               1995.
##  6 51943      FrScA   S216     03                       1222                 70 
##  7 52326      AnPhA   S216     01                       1775               1519.
##  8 52446      PhysA   S116     01                       2225               2198 
##  9 53447      FrScA   S116     01                       1212               1173 
## 10 53475      FrScA   S116     02                       1212                  0 
## # … with 933 more rows, and 28 more variables: proportion_earned <dbl>,
## #   gender <chr>, enrollment_reason <chr>, enrollment_status <chr>,
## #   time_spent <dbl>, time_spent_hours <dbl>, course_id <chr>, int <dbl>,
## #   val <dbl>, percomp <dbl>, tv <dbl>, q1 <dbl>, q2 <dbl>, q3 <dbl>, q4 <dbl>,
## #   q5 <dbl>, q6 <dbl>, q7 <dbl>, q8 <dbl>, q9 <dbl>, q10 <dbl>, date_x <dttm>,
## #   post_int <dbl>, post_uv <dbl>, post_tv <dbl>, post_percomp <dbl>,
## #   date_y <dttm>, date <dttm>

We’ll revisit joins in our Unit 2 tutorials, but for a quick overview of the different join functions with helpful visuals, visit: https://statisticsglobe.com/r-dplyr-join-inner-left-right-full-semi-anti

3. EXPLORE

As highlighted in both DSEIUR and Learning Analytics Goes to School, calculating summary statistics and data visualization are a key part of exploratory data analysis. One goal in this phase is explore questions that drove the original analysis and develop new questions and hypotheses to test in later stages. Topics addressed in Part 3 include:

  • Data Visualization.

  • Table Summaries. We learn about the global regular expression parser, or grep package in R, to search for key words among our data set.

a. Data Visualization

Histograms

We’re now ready to create a faceted plot. Like in the getting started task, we’ll use the ggplot2 package.

The code below creates a histogram with 30 bins—the default number for geom_histogram. Change the number of bins below and note any differences in what you interpret about the data.

data_to_explore %>% 
  ggplot(aes(x = time_spent_hours)) +
  geom_histogram(bins = 30)
## Warning: Removed 232 rows containing non-finite values (stat_bin).

What do you think the ideal number of bins is—with what is ideal being the number of bins that helps you to interpret the overall distribution of the values for how much time students’ spent (note: there is no one right or wrong answer here!)?

We’ll next be using the facet_wrap() function to create small multiples, or plots that are specific to subsets of your data. These subsets are identified based on another variable in your dataset. For example, the code below uses the built-in mpg dataset to plot the relationship between the displacement of a car’s engine and its highway miles per gallon fuel efficiency.

ggplot(mpg, aes(displ, hwy)) + 
  geom_point()

The code in the next plot creates individual plots for each class—think compact car or SUV.

ggplot(mpg, aes(displ, hwy)) + 
  geom_point() +
  facet_wrap(~class)

Your Turn

In the code below, create a faceted histogram based on the subject of the course. To do so, consider both:

  • What code you used to create the histogram of the time students’ spent on the course

  • How, in the example above, facet_wrap refers to the variable in that data frame that represents the class of the car—but modifying the code to work for your subject variable

You may also wish to change the color; reflect back to the getting started task for an example of how to do this.

ggplot(data_to_explore, aes(x = time_spent_hours)) + 
  geom_histogram() +
  facet_wrap(~subject)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 232 rows containing non-finite values (stat_bin).

What do you notice about this figure? And what do you wonder? Add a note (or a few notes!) below:

Scatter Plots

Having prepared both of the data sets we joined together, and worked hard to join those data sets, we’re now ready to use this joined data set in our exploration of how the time students spent on the course LMS relate to the number of points they earned throughout the course.

We’ll be using the {ggplot2} package again, but, this time, will be creating a different type of plot.

Run the code below to create a scatter plot of the proportion of points students earned and the number of hours they spent on the course LMS.

ggplot(data_to_explore, aes(x = time_spent_hours, y = proportion_earned)) +
  geom_point()
## Warning: Removed 345 rows containing missing values (geom_point).

What do you notice about this graph? And what do you wonder? How about the code—what do you notice about it (and what do you wonder)? Add one or more of what you see as the most important elements.

Using {ggplot2} makes it efficient to iterate through different versions of similar plots. For instance, we can color the points by a third variable, such as the reason for which students enrolled in the course, to begin to explore what was going on for students who spent very little time on the course:

ggplot(data_to_explore, aes(x = time_spent_hours, y = proportion_earned, color = enrollment_status)) +
  geom_point()
## Warning: Removed 345 rows containing missing values (geom_point).

Your Turn

We can also additionally create faceted plots, like the one you created in the last learning lab. In the code below, facet the plot by subject.

ggplot(data_to_explore, aes(x = time_spent_hours, y = proportion_earned, color = enrollment_status)) +
  geom_point() +
  facet_wrap(~subject)
## Warning: Removed 345 rows containing missing values (geom_point).

You may wish to style your plot. A few ways you can do that are as follows; we’ll discuss more throughout the institute. For each of the following, add them to your plot by adding a plus symbol to the line prior to the line you are adding. For instance, the following code styles the x-axis label of a plot:

ggplot(data_to_explore, aes(x = time_spent_hours, y = proportion_earned, color = enrollment_status)) +
  geom_point() +
  xlab("Time Spent (Hours)")
## Warning: Removed 345 rows containing missing values (geom_point).

Your Turn

Try adding (and modifying, if you’d like) any of the following to the faceted plot you created in the code chunk below:

  • xlab("Time Spent (Hours)")

  • ylab("Proportion of Points Earned")

  • scale_color_brewer("Enrollment Status", type = "qual", palette = 3)

  • ggtitle("How Time Spent on Course LMS is Related to Points Earned in the Course")

  • theme(legend.position = "bottom")

Once you have settled on a plot you are happy with (for now!), add a sentence or two interpreting your graph (like you were describing it within a manuscript):

b. Table Summaries

At this point, we should have quite the comprehensive data set, including single measures from a) students for the time they spent in the course LMS and other information about them, such as information on why they are enrolled in the course, b) their academic achievement

We’ll explore our data in two ways—by creating:

  1. Descriptive statistics for key variables in our data
  2. A correlation matrix for key variables that are numbers

We’ll take these in turn, considering two different ways to create correlation tables that may be suited better to particular tasks depending on one’s goals.

Skimr Package

An efficient package for creating descriptive statistics when your goal is to understand your data internally (rather than to create a table for an external-to-the-research-team audience, like for a journal article) is the {skimr} package. A key feature of the {skimr} package is that it works well with the {tidyverse} packages we are using: it takes data frames as input, and returns data frames as output, which means we can manipulate them with {tidyverse} functions like select(), filter(), and arrange(), for example.

Let’s load the {skimr} package:

library(skimr)

The challenge here is not the complexity of the skim() function, per se, but will be comprehending the terminology. In the code chunk below:

  • Pass to the skim() function a single argument (recall from our tutorials last week that functions have names and arguments!)

  • That single argument is the data frame (aka in tidyverse parlance, a tibble) for which you are aiming to calculate descriptive statistics

Run the following code to “skim” our data_to_explore tibble:

skim(data_to_explore)
Data summary
Name data_to_explore
Number of rows 943
Number of columns 34
_______________________
Column type frequency:
character 8
numeric 23
POSIXct 3
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
student_id 0 1.00 2 6 0 879 0
subject 0 1.00 4 5 0 5 0
semester 0 1.00 4 4 0 4 0
section 0 1.00 2 2 0 4 0
gender 227 0.76 1 1 0 2 0
enrollment_reason 227 0.76 5 34 0 5 0
enrollment_status 227 0.76 7 17 0 3 0
course_id 281 0.70 12 13 0 36 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
total_points_possible 226 0.76 1619.55 387.12 1212.00 1217.00 1676.00 1791.00 2425.00 ▇▂▆▁▃
total_points_earned 226 0.76 1229.98 510.64 0.00 1002.50 1177.13 1572.45 2413.50 ▂▂▇▅▂
proportion_earned 226 0.76 0.76 0.25 0.00 0.72 0.86 0.92 1.01 ▁▁▁▃▇
time_spent 232 0.75 1828.80 1363.13 0.45 895.57 1559.97 2423.94 8870.88 ▇▅▁▁▁
time_spent_hours 232 0.75 30.48 22.72 0.01 14.93 26.00 40.40 147.85 ▇▅▁▁▁
int 293 0.69 4.30 0.60 1.80 4.00 4.40 4.80 5.00 ▁▁▂▆▇
val 287 0.70 3.75 0.75 1.00 3.33 3.67 4.33 5.00 ▁▁▆▇▆
percomp 288 0.69 3.64 0.69 1.50 3.00 3.50 4.00 5.00 ▁▁▇▃▃
tv 292 0.69 4.07 0.59 1.00 3.71 4.12 4.46 5.00 ▁▁▂▇▇
q1 285 0.70 4.34 0.66 1.00 4.00 4.00 5.00 5.00 ▁▁▁▇▇
q2 285 0.70 3.66 0.93 1.00 3.00 4.00 4.00 5.00 ▁▂▆▇▃
q3 286 0.70 3.31 0.85 1.00 3.00 3.00 4.00 5.00 ▁▂▇▅▂
q4 289 0.69 4.35 0.80 1.00 4.00 5.00 5.00 5.00 ▁▁▁▆▇
q5 286 0.70 4.28 0.69 1.00 4.00 4.00 5.00 5.00 ▁▁▁▇▆
q6 285 0.70 4.05 0.80 1.00 4.00 4.00 5.00 5.00 ▁▁▃▇▅
q7 286 0.70 3.96 0.85 1.00 3.00 4.00 5.00 5.00 ▁▁▅▇▆
q8 286 0.70 4.35 0.65 1.00 4.00 4.00 5.00 5.00 ▁▁▁▇▇
q9 286 0.70 3.55 0.92 1.00 3.00 4.00 4.00 5.00 ▁▂▇▇▃
q10 285 0.70 4.17 0.87 1.00 4.00 4.00 5.00 5.00 ▁▁▃▇▇
post_int 848 0.10 3.88 0.94 1.00 3.50 4.00 4.50 5.00 ▁▁▃▇▇
post_uv 848 0.10 3.48 0.99 1.00 3.00 3.67 4.00 5.00 ▂▂▅▇▅
post_tv 848 0.10 3.71 0.90 1.00 3.29 3.86 4.29 5.00 ▁▂▃▇▆
post_percomp 848 0.10 3.47 0.88 1.00 3.00 3.50 4.00 5.00 ▁▂▂▇▂

Variable type: POSIXct

skim_variable n_missing complete_rate min max median n_unique
date_x 393 0.58 2015-09-02 15:40:00 2016-05-24 15:53:00 2015-10-01 15:57:30 536
date_y 848 0.10 2015-09-02 15:31:00 2016-01-22 15:43:00 2016-01-04 13:25:00 95
date 834 0.12 2017-01-23 13:14:00 2017-02-13 13:00:00 2017-01-25 18:43:00 107

Note: If you are having difficult viewing your data in the code chunk, try clicking the icon in the output that looks like a spreadsheet with an arrow on it to expand your output in a separate window.

What do you notice about the output? These observations might pertain to the format of the output or its values (i.e., what the mean for the val variable is). Note one or two of these noticings or wonderings below:

As we noted earlier, the {skimr} package works nicely with other {tidyverse} functions.

Hint: For help, also consider running ?skim() in the console to view the documentation for the function.

Your Turn

Recall from the Week 3 tutorials and exercises how we how we isolated data using the select() and filter() functions. In the code chunk below, look at descriptives for just proportion_earned , time_spent and gender, but only for the “OcnA” and “PhysA” subjects.

Can you do this by modifying the code below to do this?

data_to_explore %>% 
  select(proportion_earned, time_spent, gender, subject) %>% 
  filter(subject == "OcnA" | subject == "PhysA") %>%
  skim()
Data summary
Name Piped data
Number of rows 249
Number of columns 4
_______________________
Column type frequency:
character 2
numeric 2
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
gender 48 0.81 1 1 0 2 0
subject 0 1.00 4 5 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
proportion_earned 48 0.81 0.78 0.24 0.00 0.73 0.86 0.94 1.00 ▁▁▁▃▇
time_spent 48 0.81 1828.56 1374.13 0.58 943.07 1601.13 2356.88 8870.88 ▇▅▁▁▁

We noted earlier that this output is best for internal use. This is because the output is rich, but not well-suited to exporting to a table that you add, for instance, to a Google Docs or Microsoft Word manuscript. Of course, these values can be entered manually into a table, but we’ll also discuss ways later on to create tables that are ready–or nearly-ready–to be added directly to manuscripts.

If you are curious about doing more with {skimr}, check out: <https://cran.r-project.org/web/packages/skimr/vignettes/skimr.html>

4. MODEL

As highlighted in Chapter 3 of Data Science in Education Using R, the Model step of the data science process entails “using statistical models, from simple to complex, to understand trends and patterns in the data.” The authors note that while descriptive statistics and data visualization during the Explore step can help us to identify patterns and relationships in our data, statistical models can be used to help us determine if relationships, patterns and trends are actually meaningful. In part Part 4 we will learn how to:

a. Create a Correlation Matrix

As highlighted in Macfadyen and Dawson (2010),

There are two efficient ways to create correlation matrices, one that is best for internal use, and one that is best for inclusion in a manuscript.

Corrr Package

First, the {corrr} package provides a way to create a correlation matrix in a {tidyverse}-friendly way. Like for the {skimr} package, it can take as little as a line of code to create a correlation matrix. If not familiar, a correlation matrix is a table that presents how all of the variables are related to all of the other variables.

Run the following code to load the {corrr} package:

library(corrr)
## 
## Attaching package: 'corrr'
## The following object is masked from 'package:skimr':
## 
##     focus

Time Spent and Course Grade

Since the primary purpose of this case study is to investigate whether time spent in an online course is predictive of student achievement, let’s first take a look and see if there is a simple correlation between time spent and student achievement.

Run the following code to create a simple correlation matrix using the correlate() function from the {corrr} package:

data_to_explore %>% 
  select(proportion_earned, time_spent_hours) %>%
  correlate()
## 
## Correlation method: 'pearson'
## Missing treated using: 'pairwise.complete.obs'
## # A tibble: 2 × 3
##   term              proportion_earned time_spent_hours
##   <chr>                         <dbl>            <dbl>
## 1 proportion_earned            NA                0.438
## 2 time_spent_hours              0.438           NA

For the purpose of printing, and as a quick aside, the {corrr} package also has a nice fashion() function for converting a correlation data frame into a matrix with the correlations cleanly formatted (leading zeros removed; spaced for signs) and the diagonal (or any NA) left blank.

Run the following code to try it out:

data_to_explore %>% 
  select(proportion_earned, time_spent_hours, int, val, percomp) %>% 
  correlate() %>% 
  rearrange() %>%
  shave() %>%
  fashion()
## 
## Correlation method: 'pearson'
## Missing treated using: 'pairwise.complete.obs'
##                term time_spent_hours proportion_earned  int  val percomp
## 1  time_spent_hours                                                     
## 2 proportion_earned              .44                                    
## 3               int              .08               .14                  
## 4               val              .10               .02  .53             
## 5           percomp              .05               .08  .56  .48

Your Turn

In the code chunk below, select 3-4 numeric variables in addition to time_spent_hours that you think may be related to student achievement, i.e. proportion_earned, and run a simple correlation.

Hint: One key is to correlate only numeric variables. Note that while some numeric variables can technically be used, it is likely not sensible to correlate all of the variables; some—for instance, the section variable—are not very sensible to correlate!

What did you find? Were your selected variables related to time spent in the course? These observations might pertain to the format of the output or its values (i.e., what the mean for the val variable is). Note one or two of these noticings or wonderings below:

If you are interested in learning more about the {corrr} package, visit: https://corrr.tidymodels.org

APA Formatted Tables

As we noted earlier, the {skimr} package works nicely with other {tidyverse} functions. While {corrr} is a nice package to quickly create a correlation matrix, you may wish to create one that is ready to be added directly to a dissertation or journal article. {apaTables} is great for creating more formal forms of output that can be added directly to an APA-formatted manuscript; it also has functionality for regression and other types of model output. It is not as friendly to {tidyverse} functions; first, we need to select only the variables we wish to correlate.

Then, we can use that subset of the variables as the argument to theapa.cor.table() function.

Run the following code to create a subset of the larger data_to_explore data frame with the variables you wish to correlate, then create a correlation table using apa.cor.table().

library(apaTables)

data_to_explore_subset <- data_to_explore %>% 
  select(time_spent_hours, proportion_earned, int)

apa.cor.table(data_to_explore_subset)
## 
## 
## Means, standard deviations, and correlations with confidence intervals
##  
## 
##   Variable             M     SD    1           2         
##   1. time_spent_hours  30.48 22.72                       
##                                                          
##   2. proportion_earned 0.76  0.25  .44**                 
##                                    [.37, .50]            
##                                                          
##   3. int               4.30  0.60  .08         .14**     
##                                    [-.01, .16] [.06, .22]
##                                                          
## 
## Note. M and SD are used to represent mean and standard deviation, respectively.
## Values in square brackets indicate the 95% confidence interval.
## The confidence interval is a plausible range of population correlations 
## that could have caused the sample correlation (Cumming, 2014).
##  * indicates p < .05. ** indicates p < .01.
## 

This may look nice, but how to actually add this into a dissertation or journal article that you might be interested in publishing? Read the documentation for apa.cor.table() by running ?apa.cor.table() in the console. Look through the documentation and examples to understand how to output a file with the formatted correlation table, and then run the code to do that with your subset of the data_to_explore data frame.

apa.cor.table(data_to_explore_subset, filename = "cor-table.doc")
## 
## 
## Means, standard deviations, and correlations with confidence intervals
##  
## 
##   Variable             M     SD    1           2         
##   1. time_spent_hours  30.48 22.72                       
##                                                          
##   2. proportion_earned 0.76  0.25  .44**                 
##                                    [.37, .50]            
##                                                          
##   3. int               4.30  0.60  .08         .14**     
##                                    [-.01, .16] [.06, .22]
##                                                          
## 
## Note. M and SD are used to represent mean and standard deviation, respectively.
## Values in square brackets indicate the 95% confidence interval.
## The confidence interval is a plausible range of population correlations 
## that could have caused the sample correlation (Cumming, 2014).
##  * indicates p < .05. ** indicates p < .01.
## 

You should now see a new Word document in your project folder called survey-cor-table.doc. Click on that and you’ll be prompted to download from your browser.

b. Predict Academic Achievement

For the purpose of this learning lab, let’s consider the proportion_earned variable to be our dependent, or the outcome, variable.

You may be new to linear regression models, or you may have a lot of experience. In brief, a linear regression model involves estimating the relationships between one or more independent variables with one dependent variable. Mathematically, it can be written like the following.

\[ \operatorname{dependentvar} = \beta_{0} + \beta_{1}(\operatorname{independentvar}) + \epsilon \]

Here, the dependentvar is predicted by two coefficients, or things that help to explain the dependent variable. The first coefficient, \(\beta_0\), is the intercept. This coefficient tells us what the estimated value of the dependent variable is when the independent variable (independentvar) is equal to 0. The other coefficient, \(\beta_1\), or the slope, represents the association of a one-unit change in the independent variable in the value of the dependent variable.

Does Time Spent Predict Grade Earned?

Let’s consider a simple concrete example. We’ll use the lm() function in R to estimate a linear regression model.

The following code estimates a model in which proportion_earned, the proportion of points students earned, is the dependent variable. It is predicted by one independent variable, int, students’ self-reported interest in science.

lm(proportion_earned ~ time_spent_hours, 
   data = data_to_explore)
## 
## Call:
## lm(formula = proportion_earned ~ time_spent_hours, data = data_to_explore)
## 
## Coefficients:
##      (Intercept)  time_spent_hours  
##         0.624306          0.004792

Let’s take a look at the output.

We can see that the intercept is estimated at 0.53. This tells us that when students’ time spent in the online course is equal to zero, their predicted proportion of points earned is 0.62—not such a great grade, but also not surprising! But, for every one-unit, or hour, increase in time spent in science, their estimate proportion of points earned was 0.0048. So if a student spent, for instance, 40 hours on the course, their estimated final grade would be .62 + (.0048 * 40), or around .82, or 82%. A pretty solid B-!

How about interest in science?

We can add additional predictor variables by separating variables with a plus symbol. Run the following code to add int, students’ self-reported interest in science, to our linear model:

lm(proportion_earned ~ time_spent_hours + int, 
   data = data_to_explore)
## 
## Call:
## lm(formula = proportion_earned ~ time_spent_hours + int, data = data_to_explore)
## 
## Coefficients:
##      (Intercept)  time_spent_hours               int  
##         0.449657          0.004255          0.046283

We can see that the intercept is now estimated at 0.44, which tells us that when students’ time spent and interest are equal to zero, they are likely fail the course unsurprisingly. Note that that estimate for interest in science is .046, so for every one-unit increase in int, we should expect an 5 percentage point increase in their grade.

We can save the output of the function to an object—let’s say m1, standing for model 1. We can then use the summary() function built into R to view a much more feature-rich summary of the estimated model.

m1 <- lm(proportion_earned ~ time_spent_hours + int, data = data_to_explore)

summary(m1)
## 
## Call:
## lm(formula = proportion_earned ~ time_spent_hours + int, data = data_to_explore)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.66705 -0.07836  0.05049  0.14695  0.35766 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      0.449657   0.066488   6.763 3.54e-11 ***
## time_spent_hours 0.004255   0.000410  10.378  < 2e-16 ***
## int              0.046282   0.015364   3.012  0.00271 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2142 on 536 degrees of freedom
##   (404 observations deleted due to missingness)
## Multiple R-squared:  0.1859, Adjusted R-squared:  0.1828 
## F-statistic: 61.18 on 2 and 536 DF,  p-value: < 2.2e-16

There is a lot to unpack in this output, but for now the most important values to look at are those in the Estimate column, which represent the intercept and slopes for your linear regression model.

Note that the estimate for time_spent_hours is now 0.0042 and statistically significant. We see that int, interest in science, both of which are statistically significant.

Do average students earn an average grade?

Now let’s consider the mean values for each of these predictors. Recall from our tutorials last week the summarize() function from the {dplyr} package used to create summary statistics such as the mean, standard deviation, or the minimum or maximum of a value. At its core, think of summarize() as a function that returns a single value (whether it’s a mean, median, standard deviation—whichever!) that summarizes a single column.

Let’s use the summarize function to calculate the means for time spent and interest in science and add the argument na.rm = TRUE to tell R that it can ignore missing, or NA values, and to calculate the summary statistic using the non-missing values.

data_to_explore %>% 
  summarize(mean_interest = mean(int, na.rm = TRUE),
            mean_time = mean(time_spent_hours, na.rm = TRUE))
## # A tibble: 1 × 2
##   mean_interest mean_time
##           <dbl>     <dbl>
## 1          4.30      30.5

The mean value for interest is quite high. If we multiply the estimate relationship between interest and proportion of points earned—0.046—by this, the mean interest across all of the students—we can determine that students’ estimate final grade was 0.046 X 4.3, or 0.197. For hours spent spent, the average students’ estimate final grade was 0.0042 X 30.48, or 0.128.

If we add both 0.197 and 0.128 to the intercept, 0.449, that equals 0.774, or about 77%. In other words, a student with average interest in science who spent an average amount of time in the course earned a pretty average grade.

Finally, similar to our APA formatted correlation table above, we can use the {apaTables} package to create a nice regression table that could be used for later publication:

apa.reg.table(m1, filename = "lm-table.doc")
## 
## 
## Regression results using proportion_earned as the criterion
##  
## 
##         Predictor      b     b_95%_CI beta  beta_95%_CI sr2  sr2_95%_CI     r
##       (Intercept) 0.45** [0.32, 0.58]                                        
##  time_spent_hours 0.00** [0.00, 0.01] 0.41 [0.33, 0.48] .16  [.11, .22] .41**
##               int 0.05** [0.02, 0.08] 0.12 [0.04, 0.19] .01 [-.00, .03] .15**
##                                                                              
##                                                                              
##                                                                              
##              Fit
##                 
##                 
##                 
##      R2 = .186**
##  95% CI[.13,.24]
##                 
## 
## Note. A significant b-weight indicates the beta-weight and semi-partial correlation are also significant.
## b represents unstandardized regression weights. beta indicates the standardized regression weights. 
## sr2 represents the semi-partial correlation squared. r represents the zero-order correlation.
## Square brackets are used to enclose the lower and upper limits of a confidence interval.
## * indicates p < .05. ** indicates p < .01.
## 

Your Turn

Below, estimate different regression models with at least 2 variables, save as m2, and view a summary() of the results:

m2 <- lm(proportion_earned ~ int + time_spent_hours + gender + val, data = data_to_explore)

summary(m2)
## 
## Call:
## lm(formula = proportion_earned ~ int + time_spent_hours + gender + 
##     val, data = data_to_explore)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.64858 -0.07638  0.04905  0.15026  0.37673 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       0.4833104  0.0698973   6.915 1.35e-11 ***
## int               0.0697466  0.0184026   3.790 0.000168 ***
## time_spent_hours  0.0042855  0.0004139  10.353  < 2e-16 ***
## genderM          -0.0032031  0.0205245  -0.156 0.876045    
## val              -0.0353234  0.0147223  -2.399 0.016769 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2135 on 531 degrees of freedom
##   (407 observations deleted due to missingness)
## Multiple R-squared:  0.1935, Adjusted R-squared:  0.1874 
## F-statistic: 31.84 on 4 and 531 DF,  p-value: < 2.2e-16

Add a brief note or two interpreting the above model (m2):

5. COMMUNICATE

The final(ish) step in our workflow/process is sharing the results of analysis with wider audience. Krumm et al. (2018) have outline the following 3-step process for communicating with education stakeholders what you have learned through analysis:

  1. Select. Communicating what one has learned involves selecting among those analyses that are most important and most useful to an intended audience, as well as selecting a form for displaying that information, such as a graph or table in static or interactive form, i.e. a “data product.”

  2. Polish. After creating initial versions of data products, research teams often spend time refining or polishing them, by adding or editing titles, labels, and notations and by working with colors and shapes to highlight key points.

  3. Narrate. Writing a narrative to accompany the data products involves, at a minimum, pairing a data product with its related research question, describing how best to interpret the data product, and explaining the ways in which the data product helps answer the research question.

For Unit 1 we will keep it simple. In the code chunk below, select a chart, table or model created above (or create an entirely new one based a new analysis) that you think an education stakeholder might find interesting. Beneath the code chunk, write a very brief narrative to accompany your narrative.

My First Data Product (Change Me)

INSERT NARRATIVE HERE

Congratulations!

You’ve completed the first case study! To “turn in” your work, you can click the “Knit” icon at the top of the file, or click the dropdown arrow next to it and select “Knit top HTML.” This will create a report in your Files pane that serves as a record of your completed assignment and its output you can open or share.

Macfadyen, Leah P., and Shane Dawson. 2010. “Mining LMS Data to Develop an Early Warning System for Educators: A Proof of Concept.” Computers & Education 54 (2): 588–99. https://doi.org/10.1016/j.compedu.2009.09.008.